1 Introduction

This document reports the code for discovering COVID-19 meta-subgroups by comorbidities and habits, evaluated on the Mexican Government dataset published by the Mexican government. This document, as well as the supporting functions required for its execution, are available in the COVID-19 Metaclustering GitHub repository.

The COVID-19 infectious disease has led since December 2019 to a global pandemic that remains under control measures. Researchers around the world are making great efforts for a comprehensive understanding of COVID-19 from various data sources and modalities, with the goal of improving prognosis and treatments.

This work shows the results of a two-stage unsupervised Machine Learning (ML) approach, namely meta-clustering, to identify potential subphenotypes of COVID-19 patients from clinical phenotypes and demographic features. Stratification on gender and age groups was included to reduce potential confounding factors since age and gender are highly correlated with comorbidity, habits and mortality. It is worth noting that although our meta-clustering approach in this tutorial is based on age-gender stratification, in real-world there are other intriguing variable combinations that we may be interested in that can also be replicated using this code with subtle changes. For example, considering the socioeconomic level of the city where the patients live, or separating rural and non-rural areas to obtain two different sets of clusters (rural clusters and non-rural clusters) since the socioeconomic level of the city affects the amount of resources that are available for patients and thereby affecting the outcome of patients.

This work is part of the COVID-19 Subgroup Discovery and Exploration Tool (COVID-19 SDE Tool) project from the Biomedical Data Science Lab of the Universitat Politècnica de València, Spain.

We analyzed the COVID-19 open data from the Mexican Government that is released on 2020-11-02. After processing the data with proper inclusion criteria, the final sample comprises patient-level epidemiological and clinical data from 778 692 SARS-CoV-2 patients from January 13, 2020, to September 30, 2020. The methods for the dataset preprocessing is described in CovMexico_Data_Preprocessing.ipynb in our COVID-19 Metaclustering GitHub repository as well.

Inter-patient variability was analyzed by combining dimensionality reduction and hierarchical clustering methods. We produced cluster analyses for all combinations of gender and age groups (<18, 18-49, 50-64, and >64). For each group, the optimum number of clusters in this work were selected combining a quantitative approach using Silhouette coefficient, and a qualitative method through a subgroup expert inspection via visual analytics. Using the features of the resultant age-gender clusters, we performed a meta-clustering analysis to provide an overall description of the population.

In this tutorial, we found 56 age-gender specific clusters with a Multiple Correspondence Analysis 3-dimensional and a hierarchical clustering. Afterwards, we applied PCA with these age-gender clusters and then used the first 3 Principal Components as input to apply meta-clustering. Finally, we found eleven meta-clusters that provide bases to comprehend classification of patients with COVID-19 based on comorbidities, habits, demographic characteristics, geographic data and type of clinical institutions, as well as revealing the correlations between the above characteristics thereby help to anticipate the possible clinical outcomes for every specifically characterized patient. These subphenotypes can establish target groups for automated stratification or triage systems to provide personalized therapies or treatments.

All the codes can be replicated easily and adapted on many different kinds of datasets such as patient-level or individual-level dataset.

If you use this code please cite:

Lexin Zhou, Nekane Romero, Juan Martínez-Miranda, J Alberto Conejero, Juan M García-Gómez, Carlos Sáez. Heterogeneity in COVID-19 severity patterns among age-gender groups: an analysis of 778 692 Mexican patients through a meta-clustering technique. Preprint submitted to medRxiv.

If you are interested in collaborating in this work please contact us.

For further exploration of the results, please visit our COVID-19 Subgroup Discovery and Exploration Tool (COVID-19 SDE Tool). For more practical examples, you may also be interested in our preceding work about COVID-19 subgroup discovery tool, application to the nCov2019 datase whose methodology is well described in another GitHub repository COVID-19-SDE-Tool.

2 Setup

Install and load the required packages

# the function p_load() from pacman package checks to see if a package is installed, if not it attempts to install the package and then loads it
# By this way, we simplify the notebook's structure
# install.packages("pacman")   
library(pacman)
pacman::p_load(writexl, ggplot2, ggfortify, text2vec, plotly, factoextra, dplyr, stringr, kableExtra, MVN, lsa, Rtsne, FactoMineR, ape, RColorBrewer, survival, survminer, varhandle, glue, cluster, fastcluster, graphics, clusterCrit, ComplexHeatmap, reshape, pheatmap)

Source the required functions. These can be found at the COVID-19 Metaclustering GitHub repository. We recommended to download the whole repository, which includes this document .Rmd file.

source('./R/pCI.R')
source('./R/mCI.R')
source('./R/Prop.R')
source('./R/Mean.R')

3 Data loading

Load the nCov2019 dataset at 2020-11-02.

### LOAD AND FORMAT DATASET 
untar('./data/COVID19_MEXICO_LATEST_DATA.tar.gz', exdir = "data")
data = read.csv2("./data/COVID19_MEXICO_LATEST_DATA.csv", encoding="UTF-8", sep = ",", header = TRUE, na.strings = "", stringsAsFactors = FALSE,dec=".",
                 colClasses = c( "numeric",   #Index of row
                                 "Date",      #LAST_UPDATE
                                 "character", #ID_REGISTRATION
                                 "factor",    #ORIGIN
                                 "factor",    #SECTOR
                                 "character", #ENTITY_UM
                                 "factor",    #SEX
                                 "character", #ENTITY_NAT
                                 "character", #ENTITY_RES
                                 "character", #MUNICIPALITY_RES
                                 "factor",    #PATIENT_TYPE
                                 "Date",      #ADMISSION_DATE
                                 "Date",      #SYMPTOMS_DATE
                                 "Date",      #DEATH_DATE
                                 "numeric",   #INTUBATED
                                 "numeric",   #PNEUMONIA
                                 "numeric",   #AGE
                                 "character", #NATIONALITY
                                 "numeric",   #PREGNANT
                                 "numeric",   #SPEAK_INDIGENOUS_LANGUAGE
                                 "numeric",   #Indigenous
                                 "numeric",   #DIABETES
                                 "numeric",   #COPD
                                 "numeric",   #ASTHMA
                                 "numeric",   #INMUSUPR
                                 "numeric",   #HYPERTENSION
                                 "numeric",   #OTHER_DISEASE
                                 "numeric",   #CARDIOVASCULAR
                                 "numeric",   #OBESITY 
                                 "numeric",   #CHRONIC_KIDNEY_DISEASE
                                 "numeric",   #SMOKE
                                 "factor",    #OTHER_CASE_CONTACT
                                 "factor",    #TESTED
                                 "factor",    #RESULT_LAB
                                 "factor",    #FINAL_CLASIFICATION
                                 "factor",    #MIGRANT
                                 "character", #COUNTRY_NATIONALITY
                                 "character", #COUNTRY_ORIGIN
                                 "factor",    #UCI
                                 "factor",    #OUTCOME ('See the Python notebook')
                                 "numeric",   #Survival_days ('See the Python notebook')
                                 "character"  #From_symptoms_to_hospital_days ('See the Python notebook')
                 ))  

4 Data preparation

4.1 Convert and derive variables.

# Select only positive patients
data = data[data$RESULT_LAB == 'Positive SARS-CoV-2', ]
# data = data[data$RESULT_LAB == 'Non-Positive SARS-CoV-2', ]

# convert some variable types
data$ageNum = as.numeric(data$AGE)
data$Survival_days = as.numeric(data$Survival_days)
data$FromSymptomToHospital_days = as.numeric(data$FromSymptomToHospital_days)
data$ageCat = vector(mode = "character", length = nrow(data))
names(data)[names(data) == "UCI"] <- "ICU" # Translate UCI (Spanish) to ICU (Intensive Care Unit)

# make age groups <17, 18-49, 50-64, >65
idsLeq17  = data$ageNum <= 17
ids18to49 = data$ageNum > 17 & data$ageNum <= 49
ids50to64 = data$ageNum > 49 & data$ageNum <= 64
idsGeq65  = data$ageNum > 64
data$ageCat[idsLeq17] = '<18'
data$ageCat[ids18to49] = '18-49'
data$ageCat[ids50to64] = '50-64'
data$ageCat[idsGeq65] = '>64'

# Manually fix variable types
data$ADMISSION_DATE = as.Date(data$ADMISSION_DATE)
data$LAST_UPDATE = as.Date(data$LAST_UPDATE)
data$SYMPTOMS_DATE = as.Date(data$SYMPTOMS_DATE)
data$DEATH_DATE = as.Date(data$DEATH_DATE)

# remove rows with missing Admission dates and remove index column
data = data[!is.na(data$ADMISSION_DATE),-1]

# remove rows: LAST_UPDATE and ID_REGISTRATION due to their irrelevance
data = subset(data, select = -c(LAST_UPDATE, ID_REGISTRATION))

# Create new columns for categorizing survival days
data$'Survival>15days' <- ifelse(data$Survival_days>15, 1, 0)
data$'Survival>30days' <- ifelse(data$Survival_days>30, 1, 0)

data$'Survival>15days'[is.na(data$'Survival>15days')]<-1 # Fill nan-value
data$'Survival>30days'[is.na(data$'Survival>30days')]<-1

data_outcome = data[which(!is.na(data$OUTCOME)),]

data_outcome = data_outcome[!is.na(data_outcome$ageNum),]

# creates binary variables for age range.
age_rangeVector <- to.dummy(data$ageCat, "Age")
data_outcome = cbind(data_outcome, age_rangeVector)
names(data_outcome)[names(data_outcome) == "Age.<18"] <- "Age<18"
names(data_outcome)[names(data_outcome) == "Age.18-49"] <- "Age18-49"
names(data_outcome)[names(data_outcome) == "Age.50-64"] <- "Age50-64"
names(data_outcome)[names(data_outcome) == "Age.>64"] <- "Age>64" 

5 Analysis of 56 age-gender specific clusters

5.1 Create the empty vectors for meta-clustering dataframe

### It's worth noting that these vectors should be changed if the user is interested in replicating the analysis with different dataset with different variables.
name <- c()
clusterSize <- c()
Age<- c()
Gender <- c()
Pregnant <- c()
Obesity <- c()
Smoke <- c()

Pneumonia <- c()
Diabetes <- c()
COPD <- c()
Asthma <- c()
INMUSUPR <- c()
Hypertension <- c()
Other_Disease <- c()
Cardiovascular <- c()
CKD <- c()

Recovery <- c()
Survival_days <- c()
FifteenSurvival <- c()           # For all patients
ThirtySurvival <- c()            # For all patients
FifteenSurvival_deceased <- c()  # Only deceased patients
ThirtySurvival_deceased <- c()   # Only deceased patients
Symptoms_to_hospitalization_days <- c()
Hospitalized <- c()
ICU <- c()
Intubated <- c()
Other_case_contact <- c()
Age_mean <- c()

5.2 Select the proper K for each age-gender specific clustering

You may consider to change the number k for your own analysis by the following chunks. In this tutorial we select different number of k for different age-gender groups.

ages >64 years:

Bestk = list()
Bestk[['>64']] = list()
Bestk[[">64"]][["Male"]] = 8
Bestk[[">64"]][["Female"]] = 8

ages between 50-64 years:

Bestk[['50-64']] = list()
Bestk[['50-64']][["Male"]] = 9
Bestk[['50-64']][["Female"]] = 8

ages between 18-49 years:

Bestk[["18-49"]] = list()
Bestk[["18-49"]][["Male"]] = 7
Bestk[["18-49"]][["Female"]] = 7

ages < 18 years:

Bestk[["<18"]] = list()
Bestk[["<18"]][["Male"]] = 5
Bestk[["<18"]][["Female"]] = 4

Create List for the Silhouette values in all age-gender groups (not used in this tutorial) since we only used them to illustrate Silhouette Plot in our COVID-19 Subgroup Discovery and Exploration Tool (COVID-19 SDE Tool)

SilhoutteL = list()
SilhoutteL[[">64"]] = list()
SilhoutteL[["50-64"]] = list()
SilhoutteL[["18-49"]] = list()
SilhoutteL[["<18"]] = list()

5.3 Obtain all the age-gender specific clusters

this may take a long time depending on the sample size. Our dataset (n = 778 692) lasted roughly 3 hours

Genders <- c("Male", "Female") # M=Male, F=Female
Age_cat <- c("<18", "18-49", "50-64", ">64")

for (ageRange in Age_cat){
  for (sex in Genders) {
    
    ### Filtering the data and do some modification in the data frame for MCA usage
    dataExperiment = data_outcome[data_outcome$ageCat == ageRange,]
    dataExperiment = dataExperiment[dataExperiment$SEX == sex,]
    bestk = Bestk[[ageRange]][[sex]]
    # print(ageRange)
    # print(sex)
    
    ### Some preparation for MCA (Multiple Correspondence Analysis)
    # vector for comorbidities
    chronicsVector = dataExperiment %>% select(13, 19,20,21,22,23,24, 25, 27)    
    # vector for habits
    featuresVector = dataExperiment %>% select(26,28)                            
    
    featuresVector = data.frame(featuresVector)
    colnames(featuresVector) = paste0("F_",colnames(featuresVector))
    
    chronicsVector = data.frame(chronicsVector)
    colnames(chronicsVector) = paste0("C_",colnames(chronicsVector))
    
    data_analysis = cbind(featuresVector,chronicsVector)
    data_analysis = sapply(data_analysis, as.logical)
    
    
    ### Select the variables that we sought to analyze
    data_analysis_metadata = dataExperiment[,c("ageNum", "ageCat", "SEX", "ENTITY_RES","NATIONALITY","SMOKE", "SECTOR","PREGNANT","OUTCOME","ICU","INTUBATED","SECTOR","OBESITY","OTHER_CASE_CONTACT","Survival_days","PATIENT_TYPE","FromSymptomToHospital_days","Age<18","Age18-49","Age50-64","Age>64",'Survival>15days','Survival>30days')]
    
    
    
    ### Perform Multiple Correspondence Analysis on 3 dimensions
    res.mca = MCA(data_analysis, ncp = 3, graph = TRUE)
    # fviz_screeplot(res.mca, addlabels = TRUE, ylim = c(0, 45))
    ind = res.mca$ind
    var = res.mca$var
    
    ### Perform clustering
    clusterdistEuclideanMCA <- hclust.vector(ind$coord, method="ward", members=NULL, metric='euclidean', p=NULL) # ward is equivalent to ward.D2 in the library fastcluster
    groups <- cutree(clusterdistEuclideanMCA, k=bestk)
    sizes <- dataExperiment$ageNum
    resultsClustering = list("ind" = ind, "clusterdistEuclideanMCA" = clusterdistEuclideanMCA, "k" = bestk, "groups" = groups, "sizes" = sizes)
    
    ### Calculate Silhouette Coefficients for each age-gender cluster
    Silhouette <- intCriteria(ind$coord, resultsClustering$groups, c("Silhouette"))
    SilhoutteL[[ageRange]][[sex]] = Silhouette
    
    ### selection of true habits and comorbidities values from MCA
    res.mca$call$marge.col
    trues = endsWith(names(res.mca$call$marge.col), "TRUE")
    coord_T = var$coord[trues,]
    rownames(coord_T) = substr(rownames(coord_T),1,nchar(rownames(coord_T))-5)
    coord_T_F = coord_T[1:ncol(featuresVector),]
    rownames(coord_T_F) = substr(rownames(coord_T_F),3,nchar(rownames(coord_T_F)))
    coord_T_C = coord_T[-(1:ncol(featuresVector)),]
    rownames(coord_T_C) = substr(rownames(coord_T_C),3,nchar(rownames(coord_T_C)))
    
    ### Calculate statistics
    uniqueGroups = unique(resultsClustering$groups)
    # change a bit the colnames' format (optional)
    colnames(data_analysis) = substr(colnames(data_analysis),3,nchar(colnames(data_analysis)))
    colnames(data_analysis) = make.names(colnames(data_analysis), unique = TRUE)
    colnames(data_analysis) = str_replace_all(colnames(data_analysis),"_", " ")
    
    for (i in 1:length(uniqueGroups)){
      
      if(sex=='Male') {
        sex_abbre = 'M'
      } else {
        sex_abbre = 'F'
      }
      
      name <- append(name, paste(ageRange, sex_abbre,i, sep="")) # The name of the age-gender cluster. I.e: '18-49M7'
      
      patientGroupIdx = resultsClustering$groups %in% uniqueGroups[i]
      nPatientsGroup = sum(patientGroupIdx)
      clusterSize <- append(clusterSize, nPatientsGroup) # The 'n' of the cluster
      
      # comorbidity statistics
      data_analysis_subgroup = data_analysis[patientGroupIdx,,drop = FALSE]
      nind = nrow(data_analysis_subgroup)
      data_analysis_subgroupT = t(data_analysis_subgroup)
      resultsS = sapply(data.frame(data_analysis_subgroup), function(x) Prop(x)*100) ### The proportion results of input variables
      Obesity <- append(Obesity, resultsS[1])
      Smoke <- append(Smoke, resultsS[2])
      Pneumonia <- append(Pneumonia, resultsS[3])
      Diabetes <- append(Diabetes, resultsS[4])
      COPD <- append(COPD, resultsS[5])
      Asthma <- append(Asthma, resultsS[6])
      INMUSUPR <- append(INMUSUPR, resultsS[7])
      Hypertension <- append(Hypertension, resultsS[8])
      Other_Disease <- append(Other_Disease, resultsS[9])
      Cardiovascular <- append(Cardiovascular, resultsS[10])
      CKD <- append(CKD, resultsS[11])
      Age<- append(Age, ageRange)
      
      # sex, age, recovered statistics
      data_analysis_metadata_subgroup = data_analysis_metadata[patientGroupIdx,]
      
      
      # Age range statistics
      Age17Stats = Prop(data_analysis_metadata_subgroup$'Age<18' == 1)*100
      Age18Stats = Prop(data_analysis_metadata_subgroup$'Age18-49' == 1)*100
      Age50Stats = Prop(data_analysis_metadata_subgroup$'Age50-64' == 1)*100
      Age65Stats = Prop(data_analysis_metadata_subgroup$'Age>64' == 1)*100
      
      # Gender and pregnancy statistics
      femaleStats = Prop(as.character(data_analysis_metadata_subgroup$SEX) %in% 'Female')*100
      PregnantStats = Prop(data_analysis_metadata_subgroup$PREGNANT == 1)*100
   
      # ageMean = mean(data_analysis_metadata_subgroup$ageNum)
      ageStats = Mean(data_analysis_metadata_subgroup$ageNum)
      
      # Outcome statistics
      recoveredStats = Prop(data_analysis_metadata_subgroup$OUTCOME == 'Non-Deceased')*100
      survivalStats = Mean(data_analysis_metadata_subgroup$Survival_days[data_analysis_metadata_subgroup$OUTCOME == 'Deceased'], na.rm = TRUE)
      Survival15Stats = Prop(data_analysis_metadata_subgroup$'Survival>15days' == 1)*100
      Survival30Stats = Prop(data_analysis_metadata_subgroup$'Survival>30days' == 1)*100
      
      aux <- data_analysis_metadata_subgroup[data_analysis_metadata_subgroup$OUTCOME == "Deceased",]
      Survival15Stats_deceased = Prop(aux$'Survival>15days' == 1)*100
      Survival30Stats_deceased = Prop(aux$'Survival>30days' == 1)*100
      
      SympToHostStats = Mean(data_analysis_metadata_subgroup$FromSymptomToHospital_days, na.rm = TRUE)
      ICUStats = Prop(data_analysis_metadata_subgroup$ICU == 1)*100
      intubStats = Prop(data_analysis_metadata_subgroup$INTUBATED == 1)*100
      Patient_TypeStats = Prop(data_analysis_metadata_subgroup$PATIENT_TYPE == 'HOSPITALIZED')*100
      Case_ContactStats = Prop(data_analysis_metadata_subgroup$OTHER_CASE_CONTACT == 1)*100
      
      ### Append the Stats to vectors
      Pregnant <- append(Pregnant, PregnantStats)
      Age_mean <- append(Age_mean, ageStats)
      Recovery <- append(Recovery, recoveredStats)
      Survival_days <- append(Survival_days, survivalStats)
      
      FifteenSurvival <- append(FifteenSurvival, Survival15Stats)
      ThirtySurvival <- append(ThirtySurvival, Survival30Stats)
      FifteenSurvival_deceased <- append(FifteenSurvival_deceased, Survival15Stats_deceased)
      ThirtySurvival_deceased <- append(ThirtySurvival_deceased, Survival30Stats_deceased)
      
      Symptoms_to_hospitalization_days <- append(Symptoms_to_hospitalization_days, SympToHostStats)
      Hospitalized <- append(Hospitalized, Patient_TypeStats)
      ICU <- append(ICU, ICUStats)
      Intubated <- append(Intubated, intubStats)
      Other_case_contact <- append(Other_case_contact, Case_ContactStats)
      Gender <- append(Gender, sex)
    }
  }
}

5.4 Save the previous results in a dataframe

df <- data.frame(name, Obesity, Smoke, Pneumonia, Diabetes, COPD, Asthma, INMUSUPR,
                 Hypertension, Other_Disease, Cardiovascular, CKD, clusterSize, Age, Gender,
                 Pregnant, Recovery, Survival_days, FifteenSurvival, ThirtySurvival, FifteenSurvival_deceased, ThirtySurvival_deceased, Symptoms_to_hospitalization_days,
                 Hospitalized, ICU, Intubated, Other_case_contact, Age_mean)

### Each row comprises the calculated features of each age-gender group
rownames(df) <- df$name  

data_analysis_metadata = df

5.5 Apply PCA analysis for resultant 56 age-gender specific clusters with 2D plot

# vector for comorbidities
chronicsVector = data_analysis_metadata %>% dplyr::select(4,5,6,7,8,9,10,11,12)
# vector for habits
featuresVector = data_analysis_metadata %>% dplyr::select(2,3)

featuresVector = data.frame(featuresVector)
colnames(featuresVector) = paste0("F_",colnames(featuresVector))

chronicsVector = data.frame(chronicsVector)
colnames(chronicsVector) = paste0("C_",colnames(chronicsVector))

data_analysis = cbind(featuresVector,chronicsVector) # Will be used later in clustering

pca_df <- data_analysis_metadata[2:12]
pca<-(prcomp(pca_df)) # Apply PCA algorithm

### make a scree plot (loadings)
a = fviz_screeplot(pca, addlabels = TRUE, ylim = c(0, 45))

### plot pc1 and pc2

p <- autoplot(pca, data=data_analysis_metadata, colour="Age",frame=TRUE,loadings = TRUE, loadings.colour = 'blue', label = TRUE, label.size = 5, main='2D PCA from 56 COVID-19 subgroups',
         loadings.label = TRUE, loadings.label.size = 5)+
  theme_bw()

6 Meta-Clustering

6.1 Perform hierarchical clustering and cut the resultant tree using the best number of clusters k.

### Analyze the Silhouette Coefficients with plot for meta-clustering
SilhouetteCoefficients = c()
for (i in 2:12){
  bestk = i
  ind = pca$x[, 1:3] # equivalent to ind$coord
  clusterdistEuclideanPCA = hclust.vector(ind, method="ward", members=NULL, metric='euclidean', p=NULL) #Equivalent to hclust but faster with vector data.
  groups <- cutree(clusterdistEuclideanPCA, k=bestk)
  sizes <- data_analysis_metadata$Age
  resultsClustering = list("ind" = pca$x[,1:3], "clusterdistEuclideanPCA" = clusterdistEuclideanPCA, "k" = bestk, "groups" = groups, "sizes" = sizes)
  a = intCriteria(pca$x[,1:3], resultsClustering$groups, c("Silhouette"))
  SilhouetteCoefficients <- append(SilhouetteCoefficients, a[[1]])
  }
plot<-plot(SilhouetteCoefficients, type='o',col = "blue", xlab = "Number of clusters (k)", ylab = "Silhouette Coefficient",
   main = "Silhouette Coefficient Plot")

### We select k=11 for meta-clustering
bestk = 11
ind = pca$x[, 1:3] # equivalent to ind$coord in MCA
clusterdistEuclideanPCA = hclust.vector(ind, method="ward", members=NULL, metric='euclidean', p=NULL) # using Ward’s minimum variance method with Euclidean squared distance
groups <- cutree(clusterdistEuclideanPCA, k=bestk)
sizes <- data_analysis_metadata$Age
resultsClustering = list("ind" = pca$x[,1:3], "clusterdistEuclideanPCA" = clusterdistEuclideanPCA, "k" = bestk, "groups" = groups, "sizes" = sizes)

Perform first some preparation to facilitate visualizations.

# selection of true symptoms and comorbidities values from MCA
trues = endsWith(names(res.mca$call$marge.col), "TRUE")
coord_T = var$coord[trues,]
rownames(coord_T) = substr(rownames(coord_T),1,nchar(rownames(coord_T))-5)
coord_T_F = coord_T[1:ncol(featuresVector),]
rownames(coord_T_F) = substr(rownames(coord_T_F),3,nchar(rownames(coord_T_F)))
coord_T_C = coord_T[-(1:ncol(featuresVector)),]
rownames(coord_T_C) = substr(rownames(coord_T_C),3,nchar(rownames(coord_T_C)))

# configuration of scatter properties
markerSize2d = 35
markerSize2ddiamond = 25
markerSize3d = 300
markerSize3ddiamond = 250
axx <- list(
  title = "1<sup>st</sup> component"
  
)
axy <- list(
  title = "2<sup>nd</sup> component"
)
axz <- list(
  title = "3<sup>rd</sup> component"
)

### Hide grid in order to obtain high resolution plot without harsh grid lines after exporting to html file
axx_no_grid <- list(
  title = "1<sup>st</sup> component", showgrid = FALSE
  
)
axy_no_grid <- list(
  title = "2<sup>nd</sup> component", showgrid = FALSE
)

# color resources
colorsVars = brewer.pal(n = 2, name = "BrBG")

6.2 Results visualizations

In the following results, the information about the subgroups is consistent within scatter plots, detailed table, pattern discovery using LOESS model, and correspond to the subgroups found in the previous application of clustering to the selected Age-gender group with the corresponding number of clusters k.

6.2.1 COVID-19 case embeddings (2D and 3D)

Subgroups scatter

is.num <- sapply(data_analysis_metadata, is.numeric)
data_analysis_metadata[is.num] <- lapply(data_analysis_metadata[is.num], round, 2)

ax <- list(
  title = "",
  showgrid = FALSE
)

pClusters2d <- plot_ly(x = pca$x[,1], y = pca$x[,2],
              color = as.factor(paste("Subgroup",format(resultsClustering$groups, digits=2))),
              colors = brewer.pal(n = bestk, name = "Set3"),
              size=7,
              text = data_analysis_metadata$name,
              marker = list(sizemode = 'area'),
              scene = 'sceneClusters') %>%
  layout(title = "Subgroups (obtained at 3D)",
         xaxis = ax, 
         yaxis = ax) %>% 
  config(displaylogo = FALSE)

t <- list(
  family = "sans serif",
  size = 15,
  showgrid=FALSE,
  color = toRGB("grey50"))

pClusters2d <- pClusters2d %>% add_text(textfont = t)

pClusters2d
pClusters3d <- plot_ly(x = pca$x[,1], y = pca$x[,2], z = pca$x[,3],
              color = as.factor(paste("Subgroup",format(resultsClustering$groups, digits=2))),
              colors = brewer.pal(n = length(unique(resultsClustering$groups)), name = "Set2"),
              size = I(markerSize3d), type = "scatter3d", mode = "markers",
              text = paste0("Patient ID: ",rownames(pca$x),"\n Age:",resultsClustering$sizes),
              marker = list(sizemode = 'area'),
              scene = 'sceneClusters') %>%
  layout(title = "Subgroups", scene = list(xaxis=axx,yaxis=axy,zaxis=axz)) %>% 
  config(displaylogo = FALSE)

t <- list(
  family = "sans serif",
  size = 8,
  color = toRGB("grey50"))

pClusters3d

6.2.2 Summary Table regarding 11 meta-clusters

alphaci = 0.05
uniqueGroups = unique(resultsClustering$groups)
resultsTable = data.frame(matrix(nrow = ncol(data_analysis)+19, ncol = length(uniqueGroups)))
colnames(data_analysis) = substr(colnames(data_analysis),3,nchar(colnames(data_analysis)))
colnames(data_analysis) = make.names(colnames(data_analysis), unique = TRUE)
colnames(data_analysis) = str_replace_all(colnames(data_analysis),"_", " ")

# Note: this (whihout make.names) returns an error when same text is in symptoms and comorbidities, a solution might be using a column for names, or avoid repated terms
rownames(resultsTable) <- c(sprintf('No. of individuals (n<sub>total</sub> = %d)',nrow(data_analysis)), colnames(data_analysis),"Age<17","Age18-49","Age50-64","Age>65",'PREGNANT', 'Females','Age','Recovered','Survival days (deceased)','Survival>15days','Survival>30days', 'Survival>15days_deceased','Survival>30days_deceased','Symptoms to hospitalization days','Hospitalized','ICU','INTUBATED','Other case contact')
colnames(resultsTable) <- paste('Subgroup',uniqueGroups)

for (i in 1:length(uniqueGroups)){
  
  patientGroupIdx = resultsClustering$groups %in% uniqueGroups[i]
  nPatientsGroup = sum(patientGroupIdx)
  
  # comorbidity statistics
  data_analysis_subgroup = data_analysis[patientGroupIdx,,drop = FALSE]
  nind = nrow(data_analysis_subgroup)
  data_analysis_subgroupT = t(data_analysis_subgroup)
  resultsS = sapply(data.frame(data_analysis_subgroup), function(x) mCI(x, alphaci))
  subgroupResultColumn = apply(resultsS, 2, function(x) sprintf('%.2f (%.2f-%.2f)',x[1],x[2],x[3]))
  
  # sex, age, recovered statistics
  data_analysis_metadata_subgroup = data_analysis_metadata[patientGroupIdx,]
  
  #Age range stats
  Age17Stats = pCI(data_analysis_metadata_subgroup$Age == '<=17', alphaci)*100
  Age18Stats = pCI(data_analysis_metadata_subgroup$Age == '18-49', alphaci)*100
  Age50Stats = pCI(data_analysis_metadata_subgroup$Age == '50-64', alphaci)*100
  Age65Stats = pCI(data_analysis_metadata_subgroup$Age == '>=65', alphaci)*100

  femaleStats = pCI(data_analysis_metadata_subgroup$Gender=='Female', alphaci)*100
  PregnantStats = mCI(data_analysis_metadata_subgroup$Pregnant, alphaci)

  ageStats = mCI(data_analysis_metadata_subgroup$Age_mean, alphaci)
  recoveredStats = mCI(data_analysis_metadata_subgroup$Recovery, alphaci)
  survivalStats = mCI(data_analysis_metadata_subgroup$Survival_days, alphaci, na.rm = TRUE)
  SympToHostStats = mCI(data_analysis_metadata_subgroup$Symptoms_to_hospitalization_days, alphaci, na.rm = TRUE)
  ICUStats = mCI(data_analysis_metadata_subgroup$ICU, alphaci)
  intubStats = mCI(data_analysis_metadata_subgroup$Intubated, alphaci)
  Patient_TypeStats = mCI(data_analysis_metadata_subgroup$Hospitalized, alphaci)
  Case_ContactStats = mCI(data_analysis_metadata_subgroup$Other_case_contact, alphaci)
  Survival15Stats = mCI(data_analysis_metadata_subgroup$FifteenSurvival,alphaci, na.rm = TRUE)                     ### Both deceased and non-deceased patients
  Survival30Stats = mCI(data_analysis_metadata_subgroup$ThirtySurvival,alphaci, na.rm = TRUE)
  Survival15_deceasedStats = mCI(data_analysis_metadata_subgroup$FifteenSurvival_deceased,alphaci, na.rm = TRUE)   ### Percetages among deceased patients
  Survival30_deceasedStats = mCI(data_analysis_metadata_subgroup$ThirtySurvival_deceased,alphaci, na.rm = TRUE)
  
  # compile final result column
  subgroupResultColumn = c(as.character(nPatientsGroup), subgroupResultColumn, sprintf('%.2f (%.2f-%.2f)',Age17Stats[1],Age17Stats[2],Age17Stats[3]), sprintf('%.2f (%.2f-%.2f)',Age18Stats[1],Age18Stats[2],Age18Stats[3]), sprintf('%.2f (%.2f-%.2f)',Age50Stats[1],Age50Stats[2],Age50Stats[3]), sprintf('%.2f (%.2f-%.2f)',Age65Stats[1],Age65Stats[2],Age65Stats[3]), sprintf('%.2f 
                                                                                                                                                                                                                                                                                                                                                                                   (%.2f-%.2f)', PregnantStats[1], PregnantStats[2],PregnantStats[3]),sprintf('%.2f (%.2f-%.2f)',femaleStats[1],femaleStats[2],femaleStats[3]), sprintf('%.2f (%.2f-%.2f)',ageStats[1],ageStats[2],ageStats[3]), sprintf('%.2f (%.2f-%.2f)',recoveredStats[1],recoveredStats[2],recoveredStats[3]))
  
  subgroupResultColumn = c(subgroupResultColumn, sprintf('%.2f (%.2f-%.2f)',survivalStats[1],survivalStats[2],survivalStats[3]),
  sprintf('%.2f (%.2f-%.2f)',Survival15Stats[1],Survival15Stats[2],Survival15Stats[3]),
  sprintf('%.2f (%.2f-%.2f)',Survival30Stats[1],Survival30Stats[2],Survival30Stats[3]),
  sprintf('%.2f (%.2f-%.2f)',Survival15_deceasedStats[1],Survival15_deceasedStats[2],Survival15_deceasedStats[3]),
  sprintf('%.2f (%.2f-%.2f)',Survival30_deceasedStats[1],Survival30_deceasedStats[2],Survival30_deceasedStats[3]),
  sprintf('%.2f (%.2f-%.2f)',SympToHostStats[1],SympToHostStats[2],SympToHostStats[3]))
  
  subgroupResultColumn = c(subgroupResultColumn,sprintf('%.2f (%.2f-%.2f)',Patient_TypeStats[1],Patient_TypeStats[2],Patient_TypeStats[3]), sprintf('%.2f (%.2f-%.2f)',ICUStats[1],ICUStats[2],ICUStats[3]),sprintf('%.2f (%.2f-%.2f)',intubStats[1],intubStats[2],intubStats[3]), sprintf('%.2f (%.2f-%.2f)',Case_ContactStats[1],Case_ContactStats[2],Case_ContactStats[3]))
  
  resultsTable[i] <- subgroupResultColumn
}

# resultsTable$variables <- rownames(resultsTable)
# write_xlsx(resultsTable,"C:\\Users\\Lexin Zhou\\11Meta_subgroups_table.xlsx")

# names(resultsTable) <- cell_spec(names(resultsTable), background = "yellow")
tTable = kable(resultsTable, format = "html", escape = F) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F) %>%
  pack_rows("Features (%|x, CI 95%)", 2, 1+ncol(featuresVector)) %>%
  pack_rows("Comorbidities (%|x, CI 95%)", 2+ncol(featuresVector), 2+ncol(featuresVector)+ncol(chronicsVector)-1) %>%
  pack_rows("Age range (%|x, CI 95%)", 2+ncol(featuresVector)+ncol(chronicsVector), 2+ncol(featuresVector)+ncol(chronicsVector)+3)%>%
  pack_rows("Demographics (%|x, CI 95%)", 6+ncol(featuresVector)+ncol(chronicsVector), 2+ncol(featuresVector)+ncol(chronicsVector)+6) %>%
  pack_rows("Outcomes (%|x, CI 95%)", 2+ncol(featuresVector)+ncol(chronicsVector)+7, 2+ncol(featuresVector)+ncol(chronicsVector)+7+8+2) 


groupColors = brewer.pal(n = ncol(resultsTable), name = "Set3")

for (i in 1:ncol(resultsTable)) {
  tTable = tTable %>% column_spec(i+1, background = groupColors[i], include_thead = TRUE) %>%
    column_spec(i+1, background = "inherit")
}

tTable
Subgroup 1 Subgroup 2 Subgroup 3 Subgroup 4 Subgroup 5 Subgroup 6 Subgroup 7 Subgroup 8 Subgroup 9 Subgroup 10 Subgroup 11
No. of individuals (ntotal = 56) 8 6 8 8 3 7 4 3 5 3 1
Features (%|x, CI 95%)
Obesity 0.44 (-0.42-1.29) 11.77 (1.77-21.77) 59.87 (37.37-82.38) 12.01 (4.36-19.67) 75.54 (58.77-92.30) 18.89 (14.98-22.81) 20.15 (11.99-28.30) 19.05 (3.44-34.66) 25.94 (-1.77-53.66) 50.51 (39.69-61.34) 23.99 (NA-NA)
Smoke 0.00 (0.00-0.00) 9.68 (-0.96-20.31) 34.09 (11.26-56.92) 0.80 (-0.77-2.36) 10.77 (2.19-19.36) 8.10 (4.96-11.24) 4.38 (0.40-8.37) 38.03 (10.93-65.13) 0.22 (-0.22-0.67) 42.02 (21.36-62.68) 76.85 (NA-NA)
Comorbidities (%|x, CI 95%)
Pneumonia 12.36 (0.06-24.67) 37.00 (10.24-63.76) 9.08 (1.88-16.28) 37.18 (28.28-46.08) 41.52 (29.64-53.40) 42.44 (33.60-51.28) 52.14 (49.38-54.90) 43.55 (35.90-51.21) 48.10 (42.38-53.83) 46.80 (34.44-59.15) 53.61 (NA-NA)
Diabetes 0.00 (0.00-0.00) 4.42 (-0.68-9.52) 4.50 (1.09-7.91) 39.06 (20.82-57.29) 57.14 (46.70-67.59) 35.62 (26.64-44.60) 76.44 (72.73-80.15) 20.45 (9.76-31.14) 95.00 (88.20-101.79) 61.22 (48.65-73.80) 31.96 (NA-NA)
COPD 0.00 (0.00-0.00) 4.51 (-0.36-9.38) 0.00 (0.00-0.00) 0.73 (-0.70-2.16) 0.00 (0.00-0.00) 5.10 (1.29-8.91) 2.03 (-0.36-4.42) 43.91 (26.55-61.28) 2.36 (-2.27-7.00) 37.46 (18.06-56.87) 91.86 (NA-NA)
Asthma 0.37 (-0.35-1.09) 3.20 (1.00-5.40) 18.17 (4.07-32.28) 1.15 (-0.33-2.63) 2.03 (0.84-3.22) 2.69 (1.72-3.65) 0.49 (-0.34-1.32) 25.72 (5.32-46.11) 0.08 (-0.08-0.25) 19.79 (6.44-33.14) 19.63 (NA-NA)
INMUSUPR 0.00 (0.00-0.00) 13.03 (-1.95-28.01) 0.10 (-0.10-0.31) 1.40 (-1.34-4.13) 0.00 (0.00-0.00) 40.38 (26.73-54.03) 0.00 (0.00-0.00) 0.91 (-0.88-2.70) 0.00 (0.00-0.00) 0.03 (-0.00-0.06) 0.00 (NA-NA)
Hypertension 0.00 (0.00-0.00) 9.13 (1.05-17.21) 7.59 (0.71-14.48) 41.15 (29.72-52.59) 68.79 (55.80-81.77) 46.79 (37.71-55.88) 83.71 (79.07-88.35) 34.38 (28.21-40.55) 96.33 (90.32-102.34) 77.86 (60.66-95.06) 52.94 (NA-NA)
Other Disease 0.00 (0.00-0.00) 38.32 (14.19-62.46) 0.30 (-0.28-0.88) 1.22 (-0.85-3.28) 0.00 (0.00-0.00) 48.63 (31.41-65.84) 1.85 (-0.45-4.15) 1.73 (-1.66-5.13) 0.00 (0.00-0.00) 0.82 (-0.37-2.02) 0.00 (NA-NA)
Cardiovascular 0.00 (0.00-0.00) 17.52 (6.00-29.04) 0.10 (-0.10-0.31) 2.46 (0.06-4.86) 2.17 (-2.08-6.41) 14.25 (8.52-19.99) 21.64 (5.73-37.54) 4.73 (-1.62-11.07) 5.52 (-5.30-16.33) 26.51 (25.30-27.73) 27.77 (NA-NA)
CKD 0.00 (0.00-0.00) 4.27 (-2.34-10.87) 0.00 (0.00-0.00) 3.87 (-3.69-11.42) 0.22 (-0.21-0.65) 31.84 (12.44-51.23) 81.68 (67.22-96.14) 1.04 (-0.84-2.91) 1.92 (-1.84-5.68) 1.28 (0.67-1.90) 1.01 (NA-NA)
Age range (%|x, CI 95%)
Age<17 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00)
Age18-49 25.00 (0.00-55.01) 33.33 (0.00-71.05) 50.00 (15.35-84.65) 25.00 (0.00-55.01) 66.67 (13.32-100.00) 28.57 (0.00-62.04) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00)
Age50-64 25.00 (0.00-55.01) 0.00 (0.00-0.00) 37.50 (3.95-71.05) 25.00 (0.00-55.01) 33.33 (0.00-86.68) 42.86 (6.20-79.52) 50.00 (1.00-99.00) 33.33 (0.00-86.68) 40.00 (0.00-82.94) 33.33 (0.00-86.68) 0.00 (0.00-0.00)
Age>65 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00) 0.00 (0.00-0.00)
Demographics (%|x, CI 95%)
PREGNANT 0.49 (-0.19-1.17) 1.28 (-0.03-2.59) 0.30 (-0.09-0.70) 0.80 (-0.24-1.83) 0.33 (-0.31-0.97) 0.26 (-0.25-0.77) 0.01 (-0.01-0.03) 0.00 (0.00-0.00) 0.01 (-0.00-0.01) 0.00 (0.00-0.00) 0.00 (NA-NA)
Females 50.00 (15.35-84.65) 50.00 (9.99-90.01) 50.00 (15.35-84.65) 50.00 (15.35-84.65) 33.33 (0.00-86.68) 42.86 (6.20-79.52) 50.00 (1.00-99.00) 33.33 (0.00-86.68) 60.00 (17.06-100.00) 66.67 (13.32-100.00) 0.00 (0.00-0.00)
Age 43.41 (26.04-60.77) 17.98 (5.69-30.27) 39.80 (29.24-50.36) 44.76 (27.37-62.15) 46.37 (36.38-56.35) 56.34 (45.22-67.47) 65.32 (56.46-74.17) 68.74 (57.36-80.12) 66.80 (59.25-74.34) 68.25 (57.47-79.03) 76.43 (NA-NA)
Outcomes (%|x, CI 95%)
Recovered 90.27 (80.69-99.85) 91.37 (88.49-94.25) 95.21 (90.75-99.67) 82.81 (73.99-91.63) 81.30 (72.82-89.78) 66.94 (56.11-77.77) 53.96 (47.85-60.08) 66.43 (57.01-75.85) 64.95 (57.34-72.56) 64.42 (47.39-81.44) 55.96 (NA-NA)
Survival days (deceased) 13.61 (12.88-14.35) 12.83 (11.49-14.17) 14.22 (13.72-14.71) 13.26 (12.22-14.31) 13.33 (12.93-13.73) 12.67 (12.01-13.34) 11.84 (11.47-12.20) 13.46 (12.91-14.00) 12.94 (12.50-13.37) 13.27 (12.78-13.76) 12.10 (NA-NA)
Survival>15days 93.46 (86.96-99.95) 93.73 (91.36-96.10) 97.01 (94.27-99.75) 88.39 (82.57-94.21) 87.27 (81.71-92.82) 76.34 (68.52-84.16) 65.37 (60.62-70.11) 77.09 (70.86-83.32) 75.26 (69.45-81.08) 75.34 (63.34-87.35) 67.28 (NA-NA)
Survival>30days 90.74 (81.61-99.87) 91.80 (89.06-94.54) 95.50 (91.30-99.69) 83.74 (75.27-92.21) 82.14 (74.04-90.24) 68.26 (57.71-78.80) 55.71 (49.70-61.71) 68.20 (59.28-77.11) 66.33 (58.87-73.79) 65.87 (49.36-82.39) 56.96 (NA-NA)
Survival>15days_deceased 30.76 (26.80-34.72) 28.64 (23.96-33.33) 36.21 (33.89-38.52) 31.09 (25.90-36.27) 31.59 (30.22-32.96) 28.45 (26.02-30.89) 24.80 (23.33-26.27) 31.70 (28.72-34.68) 29.73 (27.83-31.63) 31.01 (28.71-33.31) 25.71 (NA-NA)
Survival>30days_deceased 6.61 (4.50-8.71) 4.64 (1.88-7.41) 5.93 (3.77-8.09) 5.79 (3.02-8.56) 4.52 (4.22-4.83) 4.20 (3.42-4.99) 3.82 (3.35-4.29) 5.26 (5.02-5.49) 4.04 (3.32-4.75) 4.24 (3.53-4.94) 2.29 (NA-NA)
Symptoms to hospitalization days 3.78 (2.90-4.66) 3.20 (2.22-4.17) 4.87 (4.46-5.27) 4.37 (3.65-5.09) 5.21 (4.94-5.47) 4.48 (4.18-4.78) 4.30 (4.12-4.49) 4.85 (4.69-5.02) 4.92 (4.81-5.04) 4.94 (4.70-5.17) 4.82 (NA-NA)
Hospitalized 19.87 (5.44-34.30) 46.08 (25.71-66.44) 14.15 (5.71-22.60) 42.22 (34.68-49.75) 44.91 (34.80-55.03) 58.56 (47.64-69.48) 70.72 (67.24-74.20) 57.17 (47.81-66.53) 60.80 (55.26-66.34) 60.11 (43.18-77.03) 70.47 (NA-NA)
ICU 1.59 (0.41-2.77) 9.82 (3.74-15.89) 1.23 (0.27-2.20) 4.48 (3.34-5.61) 5.06 (3.32-6.80) 4.01 (3.20-4.82) 4.87 (4.04-5.69) 4.81 (3.69-5.92) 5.56 (4.86-6.26) 5.24 (3.25-7.23) 5.62 (NA-NA)
INTUBATED 3.44 (0.30-6.58) 9.03 (4.72-13.34) 2.18 (0.31-4.06) 7.90 (5.63-10.18) 8.46 (5.20-11.72) 12.12 (8.74-15.50) 13.38 (12.30-14.45) 11.50 (9.14-13.86) 12.13 (10.29-13.96) 12.42 (6.55-18.29) 12.84 (NA-NA)
Other case contact 45.84 (36.85-54.84) 40.23 (33.11-47.36) 51.18 (46.11-56.25) 36.60 (32.48-40.71) 36.04 (32.92-39.16) 27.39 (22.47-32.32) 20.90 (18.49-23.31) 27.88 (24.16-31.60) 27.56 (24.86-30.26) 28.00 (21.06-34.94) 20.89 (NA-NA)

6.2.3 Heatmap describing the probability of studied variables among 56 age-gender cluters

The heatmap also specifies the meta-cluster that each age-gender cluster belongs

df1 = subset(data_analysis_metadata, select = -c(name, Age, Gender, clusterSize, Survival_days)) 
df1 <- df1 %>% 
  dplyr::rename(
    "Survival>15days" = FifteenSurvival,
    "Survival>30days" = ThirtySurvival,
    "Survival>15days (deceased)" = FifteenSurvival_deceased,
    "Survival>30days (deceased)" = ThirtySurvival_deceased
    )

# make ClusterSize groups <17, 18-49, 50-64, >65
less500  = df$clusterSize < 500
less2000 = df$clusterSize < 2000 & df$clusterSize >= 500
less5000 = df$clusterSize < 5000 & df$clusterSize >= 2000
less10000 = df$clusterSize < 10000 & df$clusterSize >= 5000
less30000 = df$clusterSize < 30000 & df$clusterSize >= 10000
more30000 = df$clusterSize > 30000

aux <- df
aux$clusterSizeCat[less500] = '<500'
aux$clusterSizeCat[less2000] = '500-1999'
aux$clusterSizeCat[less5000] = '2K-4999'
aux$clusterSizeCat[less10000] = '5K-9999'
aux$clusterSizeCat[less30000] = '10K-29999'
aux$clusterSizeCat[more30000] = '>=30000'

Clus_Size <- aux$clusterSizeCat
names(Clus_Size) <- df$name
ClusterGroup <- resultsClustering$groups 
ClusterGroup <- as.factor(ClusterGroup)
names(ClusterGroup) <- df$name

### Create annotation rows
my_annotation_row = data.frame(
                   Cluster_size = Clus_Size,
                   Cluster_subgroup = ClusterGroup
                   )

### Create annotation rows' colors
my_colour = list(
    Cluster_size = brewer.pal(9, "YlGn")[1:6],
    Cluster_subgroup = brewer.pal(n = 12, name = "Set3")[1:bestk]
    )

names(my_colour$Cluster_size) <- c('<500','500-1999','2K-4999','5K-9999','10K-29999','>=30000')
names(my_colour$Cluster_subgroup) <- c(1:bestk)

### Order the dataframe to group by the meta-clusters
df1$cluster <- resultsClustering$groups
df1 <- df1 %>% arrange(cluster)  
df1$cluster <- NULL  # Remove it because we do not want it appears twice (as numeric variables)

### Plot
### For the following pheatmap function, it is worth noting that we did not use "display_numbers = T" parameter to illustrate the exact figure (probability) of each box since the heatmap is too small which made it impossible to visualize. When user uses this tutorial we recommend to include this parameter if the user wants to generate the heatmap with numbers.
final_Heat2 <- pheatmap(df1, name='The remain variables', main='HeatMap of the 56 age-gender specific clusters among 11 meta-subgroups', angle_col = c('315'), cluster_cols=F,
         annotation_row = my_annotation_row,
         annotation_colors = my_colour, cluster_rows = F)

final_Heat2

### Redo final_Heat2 by the order of clusters' age (optional)
target <- c("<18F1","<18F2","<18F3","<18F4",
                      "<18M1","<18M2","<18M3","<18M4","<18M5",
                      "18-49F1","18-49F2","18-49F3","18-49F4","18-49F5","18-49F6","18-49F7",
                      "18-49M1","18-49M2","18-49M3","18-49M4","18-49M5","18-49M6","18-49M7",
                      "50-64F1","50-64F2","50-64F3","50-64F4","50-64F5","50-64F6","50-64F7","50-64F8",
                      "50-64M1","50-64M2","50-64M3","50-64M4","50-64M5","50-64M6","50-64M7","50-64M8","50-64M9",
                      ">64F1",">64F2",">64F3",">64F4",">64F5",">64F6",">64F7",">64F8",
                      ">64M1",">64M2",">64M3",">64M4",">64M5",">64M6",">64M7",">64M8") 

df2 <- df1[match(target, rownames(df1)),]

### Use display_numbers = T if we want to show the exact number of each box
final_Heat3 <- pheatmap(df2, name='The remain variables', main='HeatMap of the 56 age-gender specific clusters among 11 meta-subgroups', angle_col = c('315'), cluster_cols=F,cluster_rows = F)
final_Heat3

6.3 Pattern discovery among eleven meta-clusters (56 age-gender clusters) with LOESS models

6.3.1 Define generateLOESS function

generateLOESS <- function(pca, Y, Y_name) {
  # pca is the result of pca whose PC1 and PC2 will be used to estimate  severity variable that we want to predict
  # Y is the vector that comprises the values of severity variable (e.g. data_analysis_metadata$Mortality) 
  # Y_name is just the name of variable Y. We will use this for legend purpose.
  df <- data.frame(pca$x[,1], pca$x[,2], Y) 
  colnames(df) <- c('pca1','pca2','YVal')
  data.loess <- loess(YVal ~ pca1 + pca2, data = df)  ### Predicting Y using PC1 and PC2
  
  graph_reso <- 0.5
  xgrid <- seq(min(df$pca1), max(df$pca1), by = graph_reso)
  ygrid <- seq(min(df$pca2), max(df$pca2), by = graph_reso)
  
  data.fit <- expand.grid(pca1 = xgrid,
                             pca2 = ygrid)
  mtrx3d <-  predict(data.loess, newdata = data.fit)

  # plot0 <- contour(x = xgrid, y = ygrid, z = mtrx3d, xlab = "pca1", ylab = "pca2")
  # Transform data to long form
  mtrx.melt <- melt(mtrx3d, id.vars = c("pca1", "pca2"), measure.vars = YVal)
  names(mtrx.melt) <- c("pca1", "pca2", 'YVal')
  
  ### Eliminate negative values
  mtrx.melt$YVal <- ifelse(mtrx.melt$YVal < 0, 0, mtrx.melt$YVal)

  # Return data to numeric form
  mtrx.melt$pca1 <- as.numeric(str_sub(mtrx.melt$pca1, str_locate(mtrx.melt$pca1, "=")[1,1] + 1))
  mtrx.melt$pca2 <- as.numeric(str_sub(mtrx.melt$pca2, str_locate(mtrx.melt$pca2, "=")[1,1] + 1))

  # Create ten segments to be colored in
  mtrx.melt$equalSpace <- cut(mtrx.melt$YVal, 7)
  # Sort the segments in ascending order
  breaks <- levels(unique(mtrx.melt$equalSpace))
  # Plot
  plot3 <- ggplot(mtrx.melt, aes(x=pca1,
                  y=pca2, z=YVal)) +
         geom_tile(data = mtrx.melt, aes( x=pca1,
                  y=pca2, fill = equalSpace)) +
         geom_point(data = df, aes(x = pca1, y = pca2), shape = 1, size =8, color='red')+ 
         # geom_text(data=df, aes(label=Y_name),hjust=0, vjust=0, size=5)+
         geom_contour(color = "white", alpha = 0.5,size=0) +
         theme_bw() + 
         theme(panel.grid.major = element_blank(),
         panel.grid.minor = element_blank()) +
         xlab("PCA1") +
         ylab("PCA2") +
         scale_fill_manual(values = c("#80cdc1", "#c7eae5", "#f5f5f5",
                                     "#f6e8c3", "#dfc27d", "#bf812d", "#8c510a"),
                           name = Y_name, breaks = breaks, labels = breaks)
  return (plot3)
}

6.3.2 Visualization of the patterns

We studied Mortality, ICU, Intubation, Hospitalization, symptoms to hospital, Survival15, Survival30,Survival15_deceased, Survival30_deceased.

### Create Mortality variable based on Recovery rate
data_analysis_metadata$Mortality <- 100 - data_analysis_metadata$Recovery
# Mortality
generateLOESS(pca, data_analysis_metadata$Mortality, 'Mortality')

# ICU
generateLOESS(pca, data_analysis_metadata$ICU, 'ICU')

# Intubation
generateLOESS(pca, data_analysis_metadata$Intubated, 'Intubation')

# Hospitalization
generateLOESS(pca, data_analysis_metadata$Hospitalized, 'Hospitalization')

# Symptoms_to_hospitalization_days
generateLOESS(pca, data_analysis_metadata$Symptoms_to_hospitalization_days, 'Symptoms_to_hospitalization_days')

# Survival>15days (%)
generateLOESS(pca, data_analysis_metadata$FifteenSurvival, 'Survival>15days (%)')

# Survival>30days (%)
generateLOESS(pca, data_analysis_metadata$ThirtySurvival, 'Survival>30days (%)')

# Survival>15days_deceased (%)
generateLOESS(pca, data_analysis_metadata$Intubated, 'Survival>15days_deceased (%)')

# Survival>30days_deceased (%)
generateLOESS(pca, data_analysis_metadata$ThirtySurvival_deceased, 'Survival>30days_deceased (%)')